fix: restore Arrow nullable flags lost in Polars round-trips (ENG-375) by kurodo3[bot] · Pull Request #130 · nauticalab/orcapod-python

kurodo3 · 2026-04-07T03:21:32Z

Summary

Polars sets all Arrow fields to nullable=True on output regardless of original schema, corrupting schema intent for int (non-optional) vs Optional[int] columns
Adds arrow_utils.restore_schema_nullability() — looks up each field by name in reference schemas captured before the Polars operation, and reinstates the original nullable flag. This is schema-intent-based, not data-based.
Replaces the previous infer_schema_nullable workaround in Join.static_process (which was data-based and would incorrectly mark Optional[T] columns as non-nullable when they happen to contain no nulls)
Applies the fix at all 6 Polars round-trip sites in FunctionNode and the join loop in Join

Why not `infer_schema_nullable`?

infer_schema_nullable derives nullable from whether the data actually contains nulls. An Optional[int] column with no current null values would be marked nullable=False — silently wrong. restore_schema_nullability uses the Arrow schemas from before the Polars conversion, which encode the user's type annotations.

Test plan

Unit tests for restore_schema_nullability (6 cases), including the Optional[T]-with-no-nulls case that infer_schema_nullable fails
Integration tests: FunctionNode.get_all_records, FunctionNode._iter_all_from_database, Join.op_forward with both non-nullable and nullable columns
New test test_join_preserves_nullable_optional_column_with_no_nulls specifically catches the infer_schema_nullable regression
All 14 new tests pass; existing test suite unaffected

Closes ENG-375

🤖 Generated with Claude Code

Polars converts all Arrow fields to nullable=True when producing its Arrow output, corrupting schema intent for non-optional columns. Add `arrow_utils.restore_schema_nullability()` which reinstates the original nullable flag for each field by name using the reference schemas captured before any Polars operation. This is schema-intent-based (unlike `infer_schema_nullable` which is data-based and would incorrectly mark Optional columns as non-nullable when they happen to have no nulls). Apply the fix at all Polars round-trip sites: - FunctionNode: all 5 join sites and the as_table sort - Join.static_process: the per-iteration inner join loop (removes the previous infer_schema_nullable workaround) Tests added (red-green-refactor): - Unit tests for restore_schema_nullability including the Optional[T] correctness case that infer_schema_nullable fails - Integration tests for FunctionNode.get_all_records, FunctionNode._iter_all_from_database, and Join.op_forward Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-04-07T03:23:07Z

Codecov Report

❌ Patch coverage is 97.67442% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/orcapod/utils/arrow_utils.py	95.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Add TestJoinTagColumnNullability with three cases: 1. Shared tag columns with mixed nullable flags — both "id" (nullable=False) and "group" (nullable=True, Optional, no actual nulls) are preserved through the Polars inner join on those shared keys. 2. Non-shared tag columns from each side of a cartesian-product join — "id" (non-nullable) from source1 and "category" (nullable) from source2 both retain their flags in the result. 3. Three-way join (two Polars iterations) — verifies that restore_schema_ nullability is applied correctly at each iteration, not just the last. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…creation add_system_tag_columns and add_source_info were appending columns with table.append_column(string_name, array), which Arrow defaults to nullable=True. Those columns are always computed and never null, so nullable=False is the correct schema intent. This caused a regression in TestJoinOutputSchemaSystemTags after restore_schema_nullability replaced infer_schema_nullable in Join: restore_schema_nullability faithfully preserves nullable=True for these columns (from the reference schema captured before the Polars join), while the operator's output_schema prediction expected nullable=False. Fix: pass an explicit pa.field(..., nullable=False) to append_column in both functions, so the schema is correct from the moment of creation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR addresses a schema-integrity regression where Polars round-trips (Arrow -> Polars -> Arrow) drop Arrow field nullability intent by outputting all fields as nullable=True. It introduces a schema-intent-based restoration step that re-applies the original nullable flags (and field types/metadata) using reference schemas captured before Polars operations.

Changes:

Added arrow_utils.restore_schema_nullability() to restore Arrow field nullability (and field type/metadata) from one or more reference schemas.
Updated FunctionNode and Join.static_process to capture pre-Polars schemas and restore nullability after Polars joins/sorts (replacing the prior data-based infer_schema_nullable workaround in Join).
Added unit + integration tests covering round-trip nullability preservation, including the “Optional[T] with no nulls” case.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/orcapod/utils/arrow_utils.py`	Adds `restore_schema_nullability()` and explicitly marks computed/system/source columns as non-nullable.
`src/orcapod/core/operators/join.py`	Restores schema nullability after Polars join; removes `infer_schema_nullable` cast.
`src/orcapod/core/nodes/function_node.py`	Restores schema nullability after Polars joins/sorts in multiple code paths.
`tests/test_data/test_polars_nullability/test_restore_schema_nullability.py`	Unit tests for `restore_schema_nullability`, plus demonstrations of the Polars behavior.
`tests/test_data/test_polars_nullability/test_function_node_nullability.py`	Integration tests ensuring `FunctionNode` + `Join` preserve nullable intent across joins.
`tests/test_data/test_polars_nullability/__init__.py`	Package marker for the new test module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T04:26:38Z

tests/test_data/test_polars_nullability/test_restore_schema_nullability.py

+
+import pyarrow as pa
+import polars as pl
+import pytest


pytest is imported but not used in this test module. Please remove it to avoid unused-import noise (and potential lint failures if ruff/pyflakes is enabled in CI).

Suggested change

import pytest

Copilot · 2026-04-07T04:26:39Z

tests/test_data/test_polars_nullability/test_restore_schema_nullability.py

+        assert joined.schema.field("score").nullable is True
+
+
+class TestRestoreSchemaHullability:


Typo in class name: TestRestoreSchemaHullability should be TestRestoreSchemaNullability (or similar) to match the helper being tested.

Suggested change

class TestRestoreSchemaHullability:

class TestRestoreSchemaNullability:

Copilot · 2026-04-07T04:26:39Z

tests/test_data/test_polars_nullability/test_function_node_nullability.py

+import pyarrow as pa
+import pytest
+
+import orcapod as op
+from orcapod.core.nodes.function_node import FunctionNode
+from orcapod.databases import InMemoryArrowDatabase


pytest is imported but not used in this module. Please remove it to avoid unused-import noise (and potential lint failures if linting is enabled).

Copilot · 2026-04-07T04:26:39Z

src/orcapod/utils/arrow_utils.py

@@ -1093,7 +1170,11 @@ def add_source_info(
            [f"{source_val}::{col}" for source_val in source_column],
            type=pa.large_string(),
        )
-        table = table.append_column(f"{constants.SOURCE_PREFIX}{col}", source_column)
+        # Source info columns are always computed strings, never null.


source_column is overwritten inside the per-column loop, so the source-info token accumulates previous column names (e.g., src::col1::col2) rather than staying src::col2 for the col2 source column. Preserve the original per-row source_info values (e.g., keep a separate base list/array) and build each _source_{col} column from that base.

Copilot · 2026-04-07T04:26:39Z

src/orcapod/core/nodes/function_node.py

        if column_config.sort_by_tags:
+            output_table_schema = output_table.schema
            output_table = (
                pl.DataFrame(output_table)
                .sort(by=self.keys()[0], descending=False)
                .to_arrow()
            )
+            output_table = arrow_utils.restore_schema_nullability(output_table, output_table_schema)


This PR restores nullability after Polars .sort(...).to_arrow() in FunctionNode.as_table, but there is another Polars sort round-trip in FunctionPodStream.as_table (src/orcapod/core/function_pod.py, around the column_config.sort_by_tags block) that still drops nullable flags. Consider applying restore_schema_nullability there too for consistent behavior across stream/table materialization paths.

- Remove unused `import pytest` from both test modules - Rename TestRestoreSchemaHullability → TestRestoreSchemaNullability (typo fix) - Fix source_column accumulation bug in add_source_info: each _source_<col> column was being built from the array produced for the previous column, causing tokens like "src::col1::col2" instead of "src::col2". Introduce base_source captured once before the loop so every column is independently derived from the original per-row source tokens. - Apply restore_schema_nullability after the Polars sort in FunctionPodStream.as_table (function_pod.py) to match the same fix already applied in FunctionNode.as_table, covering the stream materialization path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3

Addressed all five review comments in commit 3218c4c:

1 & 3 — Unused import pytest
Removed from both test_restore_schema_nullability.py and test_function_node_nullability.py.

2 — Typo in class name
TestRestoreSchemaHullability → TestRestoreSchemaNullability.

4 — source_column accumulation bug in add_source_info
Real bug — thanks for catching it. The loop was reassigning source_column to the pa.array built for each column, so the next iteration started from already-appended values (e.g. "src::col1::col2" instead of "src::col2"). Fixed by introducing base_source (captured once before the loop) and building each per-column array from that base independently.

5 — Missing restore_schema_nullability in FunctionPodStream.as_table
Applied the same pattern (capture schema before, restore after) to the sort_by_tags Polars sort in function_pod.py, consistent with the equivalent fix in FunctionNode.as_table.

…375) TestAddSourceInfo in test_arrow_utils.py covers four cases: 1. Single column — baseline: _source_x = "src::x". 2. Multi-column — the regression case: each _source_<col> must equal "base::<col>", not "base::prev_col::<col>". With the bug, _source_y would be "base::x::y" because source_column was reused across iterations. 3. Per-row source tokens — same accumulation bug with a list input: _source_b row 0 must be "src0::b", not "src0::a::b". 4. Column count — one _source_<col> per data column, no more. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…G-375) Remove assertion messages and docstring prose that described the buggy output pattern. Tests now simply assert the exact contract — each _source_<col> value is '<source_token>::<col>' — which is sufficient to catch the regression without mentioning what the wrong value would be. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T05:13:44Z

src/orcapod/utils/arrow_utils.py

    if source_info is None or isinstance(source_info, str):
-        source_column = [source_info] * table.num_rows
+        base_source = [source_info] * table.num_rows
    elif isinstance(source_info, Collection):
        if len(source_info) != table.num_rows:


base_source is only assigned for source_info is None/str or when source_info is a Collection; any other runtime type will fall through and later raise an UnboundLocalError when building col_source. Add an explicit else: raise TypeError(...) (or normalize other iterables) so callers get a clear error and the function can’t crash with an unbound local.

Fixed in d4bcd72 — added else: raise TypeError(...) after the elif branch so any unsized iterable (e.g. a generator) gets a clear error instead of an UnboundLocalError.

Copilot · 2026-04-07T05:13:44Z

src/orcapod/core/operators/join.py

+            # field names, preserving nullable flags — must be done BEFORE the
+            # Polars rename call below, which loses all nullability information.
+            next_ref_schema = pa.schema([
+                pa.field(rename_map.get(f.name, f.name), f.type, nullable=f.nullable, metadata=f.metadata)
+                for f in next_table.schema
+            ])
+
            if rename_map:
                next_table = pl.DataFrame(next_table).rename(rename_map).to_arrow()


When rename_map is non-empty, this does an Arrow→Polars→Arrow round-trip just to rename, and then immediately converts next_table to Polars again for the join. This is an avoidable extra conversion (and can be expensive for large tables). Consider renaming in Arrow (compute new name list and call next_table.rename_columns(...)) or keep a pl.DataFrame(next_table).rename(rename_map) and pass that directly into the join without converting back to Arrow in between.

Suggested change

# field names, preserving nullable flags — must be done BEFORE the

# Polars rename call below, which loses all nullability information.

next_ref_schema = pa.schema([

pa.field(rename_map.get(f.name, f.name), f.type, nullable=f.nullable, metadata=f.metadata)

for f in next_table.schema

])

if rename_map:

next_table = pl.DataFrame(next_table).rename(rename_map).to_arrow()

# field names, preserving nullable flags before the Polars join,

# which loses original nullability information.

next_ref_schema = pa.schema([

pa.field(rename_map.get(f.name, f.name), f.type, nullable=f.nullable, metadata=f.metadata)

for f in next_table.schema

])

if rename_map:

next_table = next_table.rename_columns([

rename_map.get(name, name) for name in next_table.column_names

])

Fixed in d4bcd72 — replaced pl.DataFrame(next_table).rename(rename_map).to_arrow() with Arrow-native next_table.rename_columns([rename_map.get(name, name) for name in next_table.column_names]). Eliminates the round-trip entirely.

Copilot · 2026-04-07T05:13:45Z

tests/test_data/test_polars_nullability/test_restore_schema_nullability.py

+"""
+Unit tests for the restore_schema_nullability helper in arrow_utils.
+
+RED phase: all tests in this file must fail before the helper exists.


The module docstring mentions a “RED phase” expectation that tests must fail before the helper exists, which will be misleading once merged (and isn’t actionable for future maintainers). Consider removing/updating this to describe the behavioral contract being tested rather than the development phase.

Suggested change

RED phase: all tests in this file must fail before the helper exists.

These tests document that Polars round-trips and joins can widen Arrow field

nullability, and verify that restore_schema_nullability reapplies the original

schema's nullable=False flags where appropriate.

Fixed in d4bcd72 — removed the RED phase phrasing; docstring now describes the behavioral contract being validated.

Copilot · 2026-04-07T05:13:45Z

tests/test_data/test_polars_nullability/test_function_node_nullability.py

+Integration tests: FunctionNode and Join preserve non-nullable column constraints
+after the Arrow → Polars → Arrow round-trip that occurs during joins.
+
+RED phase: tests should fail before the fix is applied.


The docstring states these tests are “RED phase” and should fail before the fix is applied; after merge this becomes stale and potentially confusing. Consider removing that phrasing and focusing the docstring on the nullability guarantees being validated.

Suggested change

Integration tests: FunctionNode and Join preserve non-nullable column constraints

after the Arrow → Polars → Arrow round-trip that occurs during joins.

RED phase: tests should fail before the fix is applied.

Integration tests validating that FunctionNode and Join preserve non-nullable

column constraints across the Arrow → Polars → Arrow round-trip that occurs

during joins and record retrieval.

Fixed in d4bcd72 — same update: RED phase line removed, docstring now describes the nullability guarantees being tested.

- join.py: replace Polars rename round-trip with Arrow-native rename_columns(), eliminating an unnecessary Arrow→Polars→Arrow conversion before the join. - arrow_utils.py: add explicit else/TypeError in add_source_info so callers passing an unsized iterable (e.g. a generator) get a clear error instead of an UnboundLocalError from base_source being unbound. - test_function_node_nullability.py: remove stale "RED phase" phrasing from module docstring; replace with description of the behavioral contract being validated. - test_restore_schema_nullability.py: same stale docstring fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-04-07T05:44:17Z

Second review round addressed in commit d4bcd72:

Polars rename round-trip (join.py): Replaced pl.DataFrame(next_table).rename(rename_map).to_arrow() with Arrow-native next_table.rename_columns(...). Eliminates an unnecessary Arrow→Polars→Arrow conversion that happened before the join.

base_source UnboundLocalError (arrow_utils.py): Added else: raise TypeError(...) after the elif isinstance(source_info, Collection) branch. Any unsized iterable (e.g. a generator, which doesn't satisfy Collection due to missing __len__) now gets a clear error message rather than an UnboundLocalError when the loop tries to use base_source.

Stale "RED phase" docstrings (both test files): Removed the development-phase phrasing from both module docstrings. Both now describe the behavioral contract being validated rather than the TDD workflow used to write them.

Resolve conflicts in function_node.py between our restore_schema_nullability additions and dev's _filter_by_content_hash calls. In each conflict site, apply the content-hash filter first (dev), then capture the schemas for restore_schema_nullability (ours). The first conflict also absorbs dev's rename of the join result variable from `joined` to `filtered` in _get_cached_results_for_entries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 bot and others added 2 commits April 7, 2026 03:24

eywalker requested a review from Copilot April 7, 2026 04:20

Copilot started reviewing on behalf of eywalker April 7, 2026 04:21 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

kurodo3 bot commented Apr 7, 2026

View reviewed changes

kurodo3 bot and others added 2 commits April 7, 2026 04:32

eywalker requested a review from Copilot April 7, 2026 05:08

Copilot started reviewing on behalf of eywalker April 7, 2026 05:09 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

eywalker approved these changes Apr 7, 2026

View reviewed changes

eywalker merged commit 54bb187 into dev Apr 7, 2026
9 checks passed

eywalker deleted the eng-375/fix-polars-nullability branch April 7, 2026 06:22

		assert joined.schema.field("score").nullable is True


		class TestRestoreSchemaHullability:

	class TestRestoreSchemaHullability:
	class TestRestoreSchemaNullability:

-RED phase: all tests in this file must fail before the helper exists.
+These tests document that Polars round-trips and joins can widen Arrow field
+nullability, and verify that restore_schema_nullability reapplies the original
+schema's nullable=False flags where appropriate.

Conversation

kurodo3 bot commented Apr 7, 2026

Summary

Why not infer_schema_nullable?

Test plan

Uh oh!

codecov bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

kurodo3 bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

kurodo3 bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

kurodo3 bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

kurodo3 bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

kurodo3 bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

kurodo3 bot commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why not `infer_schema_nullable`?

codecov bot commented Apr 7, 2026 •

edited

Loading